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1.  Introduction 


The  problem  of  sorting  N  numbers  with  a  fixed-connection  network  has  a  long  and  rich  history 
(1-301.  For  the  most  part,  the  complexity  of  parallel  sorting  has  been  measured  in  terms  of  the 
number  of  processors  used  and  the  number  of  parallel  operations  performed.  In  conjunction  with 
the  development  of  distributed  computing  and  very  large  scale  integration  (VLSI),  the  complexity 
of  parallel  sorting  has  also  been  measured  in  terms  of  required  information  transfer  and  ••hip  area. 
Determining  the  complexity  of  sorting  in  these  four  measures  has  remained  a  difficult  problem 
for  some  time.  Recently,  however,  several  significant  advances  have  been  made.  In  some  cases 
(particular^  the  breakthrough  work  of  Ajtai,  Komlos  and  Szemcredi  [2]),  tight  bounds  have  been 
proved.  In  other  cases  (most  notably  (24]  and  (28j),  methods  have  been  developed  that  almost 
lead  to  tight  bounds  and  that  substantially  increase  our  knowledge  of  the  problem. 

In  this  paper,  we  combine  the  work  of  (2]  and  (28]  with  new  methods  to  precisely  determine 
the  complexity  of  sorting  with  fixed-connection  networks.  Our  results  and  their  rehvance  to 
previous  work  are  described  in  the  remainder  of  the  introduction.  The  proofs  are  contained  in 
Sections  2  through  5.  In  particular:  Section  2  contains  a  description  of  a  simple  sorting  algorithm 
that  we  call  columnsort,  Section  3  contains  proofs  of  the  bounds  for  the  number  of  nodes  needed 
to  sort,  Scc'  ion  4  contains  proofs  of  the  bounds  for  information  transfer  and  area,  and  Section 
5  describes  irea-optimal,  small-constant-factor  networks.  We  conclude  with  some  remarks  and 
directions  for  future  research  in  Section  6. 

1.1  Two  Fundamental  Sorting  Problems 

Much  of  the  work  on  parallel  sorting  has  been  directed  towards  solving  the  follcwing  two 
problems. 

Problem  1:  Construct  an  0(log  N)-level  circuit  that  sorts  N  numbers. 

An  example  of  a  3-level  circuit  that  sorts  4  numbers  is  shown  in  Figure  1.  In  general,  each 
level  consists  of  A'/2  disjoint  comparators.  Each  comparator  can  be  viewed  as  an  edge  that 
possibly  exchanges  the  numbers  at  its  endpoints  so  that  the  bigger  number  exits  at  the  endpoint 
marked  B  and  so  that  the  smaller  number  emerges  at  the  other  endpoint  (marked  S).  After 
passing  through  all  the  levels  (from  left  to  right),  the  numbers  emerge  from  the  circuit  in  sorted 
order.  As  the  comparisons  and  exchanges  in  each  level  are  performed  simultaneously  the  total 
time  required  to  sort  the  N  numbers  is  equal  to  the  number  of  levels  in  the  circuit.  Figure  2 
illustrates  the  sorting  process  for  the  list  4,  7,  2,  9.  In  this  example,  2  and  4  arc  exchanged  in 
the  first  level,  and  4  and  7  are  exchanged  in  the  third  level. 

Problem  2:  Construct  a  bounded- degree,  0{N)-node  network  that  sorts  N  mmbers  in 
0(log  N)  steps. 

An  example  of  a  4-node,  degrce-3  network  that  sorts  4  numbers  in  3  steps  is  shown  in  Figure  3. 
As  in  Probl  :tn  1,  the  edges  serve  to  transmit  numbers  between  processors.  In  Problem  2,  however, 
the  nodes  are  active  throughout  the  algorithm  and  must  be  equipped  with  a  local  control  telling 
them  what  to  do  at  each  step.  In  general,  the  control  might  be  quite  complex  (depending  on  the 
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time,  the  position  of  the  node  in  the  network,  the  value  of  TV,  as  well  as  the  history  of  everything 
the  node  has  seen  and  done).  For  our  purposes,  however,  it  will  be  sufficient  to  consider  local 
controls  that  can  be  described  with  a  constant  number  of  bits.  In  the  example  shown  in  Figure  3, 
each  node  corresponds  to  a  row  of  Figure  1  and  each  edge  corresponds  to  a  comparalo  in  Figure 
1.  In  this  case,  each  local  control  tells  its  node  which  edge  should  be  used  as  a  comparator  at 
each  step.  Level  1  edges  are  used  at  the  first  step,  level  2  edges  at  the  second  step,  and  level  3 
edges  at  the  third  step.  After  3  steps,  the  numbers  are  sorted. 

Early  work  on  Problems  1  and  2  led  to  the  construction  of  a  0(log2  TV)-level  sorting  circuit  [4] 
and  to  the  construction  of  an  TV- node,  degree-3  network  that  sorts  TV  numbers  in  ©(log2  TV)  steps 
[26].  Both  constructions  are  based  on  the  butterfly  implementation  of  odd-even  merge  sort  (e.g., 
see  [18,  29  ). More  recently,  Ajtai,  Komlos  and  Szemeredi  [2]  solved  Problem  1  by  constructing 
an  O(log  A')-levcl  sorting  circuit.  {Henceforth,  we  will  refer  to  this  circuit  as  the  AKS  sorting 
circuit.)  T lis  result  also  provided  partial  solutions  to  Problem  2;  namely  an  TV-node.  0(log  TV)- 
degree  network  that  sorts  in  O(logTV)  steps,  and  a  Q(N  log  TV)-node  bounded-degree  network 
that  sorts  n  0(logAr)  steps.  In  both  cases,  the  resulting  network  for  Problem  2  has  O(TVlogTV) 
edges. 

The  on  y  other  improvement  of  the  initial  0(log2  TV)-step  bound  for  Problem  2  is  due  to  Rcif 
and  Valiant  [24],  who  constructed  an  TV-node,  bounded-degree  network  that  can  sort  in  O(logTV) 
steps  with  high  probability  provided  that  each  node  is  allowed  to  maintain  an  C?(log  V)-number 
queue.  If  (as  is  common)  unbounded-size  queues  are  not  allowed,  then  the  number  of  nodes  in  the 
Reif- Valiant  construction  may  have  to  be  increased  by  a  factor  of  0(log  TV)  to  simulate  the  queues. 
Hence  the  construction  may  really  require  ©(TVlogTV)  nodes.  In  addition,  the  time  requirement 
might  have  to  be  increased  by  a  factor  of  0(loglog  TV)  to  manage  the  queues.  (Whether  or  not 
the  factor  of  0(logTV)  blowup  in  the  number  of  nodes  and  of  0(loglog  TV)  in  time  is  really  needed 
to  simulate  the  queues  is  not  known.  Very  recent  work  by  Pippcnger,  however,  suggests  that  the 
blowups  may  not  be  needed,  at  least  for  a  modified  version  of  the  algorithm  [23].)  Moreover,  it 
is  known  that  the  randomness  assumption  cannot  be  removed  from  the  Reif- Valiant  algorithm 
since  Borodin  and  Hopcroft  [9]  showed  that  any  such  algorithm  requires  D(\/TV)  steps  in  the 
worst  case. 

In  Theorem  1  of  this  paper,  we  show  that  any  solution  for  Problem  1  can  be  simply  t*ansformed 
into  a  solution  fo'  Problem  2,  thereby  extending  the  work  or  [2]  to  solve  Problem  2.  In  fact,  we 
prove  that  any  depth-T  sorting  circuit  can  be  transformed  into  an  TV-node,  degree-3  network  that 
sorts  TV  numbers  in  0(T)  steps.  The  proof  of  this  simple  yet  unexpected  fact  combines  a  standard 
pipelining  argument  with  a  generalization  of  odd-even  merge  sort  that  we  call  eolunnaoTt.  The 
details  are  provided  in  Sections  2  and  3. 

1.2  The  Bit  Model  of  Computation 

Problem  2  can  be  reformulated  in  a  variety  of  ways.  One  natural  formulation  restricts  each 
node  to  ha^e  finite  memory  and  control.  Such  is  the  case  for  the  bit  model  of  parallel  computation, 
where  eacli  node  has  just  c  bits  of  state  for  some  constant  c  that  is  independent  of  TV.  In  the 
bit  model,  each  bit  of  each  input  number  and  of  each  sorted  number  is  treated  individually. 
During  a  single  bit  step,  each  node  can  perform  a  constant  number  of  bit-size  operations  (such 
as  a  compirc).  As  a  consequence  of  these  restrictions,  it  is  not  possible  to  store  a  (-'(log  TV)-bit 
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number  in  a  single  node,  nor  is  it  possible  to  compare  two  0(log  7V)-bit  numbers  in  a  single  bit 
step.  Hence  the  bit  model  is  far  more  restrictive  than  the  corresponding  word  model,  for  which 
Problems  1  and  2  were  originally  defined.  (Two  numbers  can  be  compared  in  a  single  word  step 
in  the  word  model,  for  example.) 

It  is  well  known  [12],  that  any  0(log  /V)-levcl  sorting  circuit  in  the  word  model  can  be  trans¬ 
formed  into  an  0(N  logjV)-node,  bounded-degree  network  that  sorts  N  0(log  N)-bit  numbers  in 
0(log  N)  bit  steps  in  the  bit  model.  In  fact,  the  two  networks  are  the  same.  Instead  of  passing 
whole  numbers  at  a  time  from  left  to  right,  the  bit  model  version  of  the  network  passes  the  num¬ 
bers  bit-by-bit  from  left  to  right,  most  significant  bits  first.  Each  comparator  sees  two  numbers 
bit-by-bit,  leading  bits  first.  As  long  as  the,  leading  bits  of  the  two  numbers  are  identical,  the 
comparator  simply  passes  the  bits  through.  (Of  course,  the  output  would  be  the  same  if  the 
comparator  were  exchanging  the  bits  as  long  as  the  leading  bits  are  identical.)  Once  bits  are 
found  that  are  different,  the  comparator  knows  instantly  which  number  is  bigger  and  henceforth 
sends  all  remaining  bits  of  the  bigger  number  to  the  node  marked  B  and  all  remaining  bits  of  the 
smaller  number  to  the  node  marked  S.  The  total  time  taken  is  the  sum  of  the  number  of  levels 
in  the  circuit  and  the  length  of  the  bit  strings  for  each  number.  For  0(log  N)-bit  numbers  and 
an  0(log  Af)-level  circuit,  at  most  0(log  N)  bit  steps  are  used. 

It  is  natural  to  ask  whether  or  not  S1(N  logJV)  nodes  are  really  needed  to  sort  N  0(logfV)-bit 
numbers.  For  example,  it  might  be  possible  to  sort  with  O(N)  nodes,  particularly  if  more  time  is 
allowed.  In  Theorem  2,  we  show  that  this  is  not  the  case.  In  fact,  we  show  that  no  matter  how 
much  time  is  allowed,  Q[N\ogN)  nodes  are  required  to  sort  N  (2  log  fV)-bit  numbers.  The  result 
can  be  extended  to  sorting  i-bit  numbers  where  k  >  (1  +  t)logfV  for  any  £  >  0,  but  cannot 
be  extended  for  values  of  k  <  log  TV  since  N  (log  7V)-bit  numbers  can  be  sorted  (given  enough 
time)  using  O(N)  nodes  [25].  The  proof  of  the  lower  bound  requires  that  each  input  is  provided 
just  once  and  that  the  input/output  schedule  is  where  and  when  oblivious.  The  arguments  used 
are  similar  to  those  used  by  Ullman  [29]  to  prove  an  fi(./V)  lower  bound  on  the  number  of  nodes 
needed  to  sort  N  (logN  +  1  )-bit  numbers. 

In  case  it  wasn’t  already  obvious,  the  proof  of  the  H(N\ogN)  lower  bound  on  the  number 
of  processors  necessary  to  sort  N  numbers  in  the  bit  model  also  provides  an  fi(N)  lower  bound 
on  the  number  of  processors  necessary  to  sort  in  the  word  model.  Taken  together,  the  results  of 
Section  3  provide  tight  bounds  on  the  number  of  nodes  needed  to  sort  in  both  the  bit  and  word 
models.  In  addition,  both  lower  bounds  are  achieved  with  0(log  fV)-time  algorithms  (the  best 
possible). 


1.3  Information  Transfer  and  Area  Bounds 


With  the  development  of  VLSI  technology,  wire  area  has  become  an  important  measure  of 
a  problem’s  difficulty.  Whereas  there  is  no  relationship  between  the  number  of  nodes  and  the 
number  of  steps  necessary  to  sort  N  numbers,  there  is  a  relationship  between  the  wire  area  and 
the  number  of  steps.  In  the  word  model,  it  is  well  known  that  AT3  >  Cl (N3)  where  A  is  the 
minimum  area  of  any  network  that  sorts  in  T  steps  [27,  28].  For  T  —  O(logAf),  this  means 
that  A  >  fl(  N3 /  log2  N).  In  Theorem  3,  we  show  that  the  network  constructed  in  Section  3 
achieves  this  bound.  Moreover,  the  construction  can  be  modified  (without  increasing  the  number 


4 


> 

% 

\ 


r  «  -s  a  Vf  n  'j  *  J  ' 


^1*  J  'I  -H  I  PI  Ul'i'M.  1  J 


of  nodes,  O(TV))  to  achieve  the  AT 2  lower  bound  for  any  T  in  the  range  fl(log  N)  <  T  <  0(\/~N). 
Formerly,  such  results  were  known  only  for  T  >  fi(log3  Af)  [6]. 

Of  more  practical  interest  is  the  AT 2  tradeoff  for  sorting  networks  in  the  bit  model.  Using 
crossing  sequence  techniques,  several  researchers  have  shown  that  the  information  transfer  neces¬ 
sary  to  sort  A'  (log  A-  +  1 )- bit  numbers  is  at  least  H(Ar).  The  information  transfer  or,  equhalently, 
communication  complexity  of  a  problem  is  the  minimum  number  of  bits  that  must  pass  Detween 
the  left  and  right  halves  of  the  chip  during  a  worst-case  computation  (provided  that  each  “half” 
of  the  chip  oi  tputs  half  of  the  bits  of  the  sorted  numbers).  Since  the  square  of  the  information 
transfer  is  a  ower  bound  for  AT2  (27),  wo  know  that  AT 2  >  fi (N2).  Angluin  and  Thompson 
[3]  (and  later  El  Gamal  [10])  improved  this  bound  to  AT 2  >  f1(N2logAr)  and  Thompion  con¬ 
jectured  that  the  true  bound  is  AT2  >  Cl(N2  log2  N).  Although  the  intuition  for  the  stronger 
bound  is  clea-,  it  can  be  misleading.  For  example,  the  same  intuition  also  leads  to  the  conjecture 
that  AT 2  >  1(  Ar2/c2)  for  sorting  N  Jfc-bit  numbers  where  Ik  >  >  Q(log  N).  The  latter  conjecture 
is  false,  however.  In  fact,  the  results  in  this  paper  can  be  used  to  construct  networks  fo-  sorting 
N  /c-bit  numbers  that  achieve  an  AT 2  bound  of  0(N2k\ogN  \og2[i~j^))  which  is  significantly 
less  than  Q{N2k2)  for  large  k  [18]. 

In  Theorem  4,  we  verify  that  AT2  >  U(N2  log2  N)  by  showing  that  the  information  transfer 
necessary  to  sort  N  (7  log  N)-  bit  numbers  is  0(ArlogAr).  As  before,  this  bound  can  be  achieved 
for  all  T  in  -he  range  fl(log  N)  <  T  <  0[-JN  log  A').  For  each  T,  the  network  contains  just 
C>(ArlogAr)  nodes,  the  fewest  possible.  Previous  constructions  for  achieving  these  bounds  were 
limited  to  the  case  when  T  >  fl(log3  TV)  [6]. 


1.4  Applications 

With  the  advent  of  VLSI  technology,  it  has  become  possible  to  fabricate  large  numbers  of 
simple  processors  and  to  integrate  them  into  large-scale,  highly-parallel  computers.  In  some 
circles,  such  machines  are  called  supercomputers.  Examples  include  the  M.I.T.  Connection 
Machine  and  the  N.Y.U.  Ultracomputer.  For  the  most  part,  the  architectures  of  these  machines 
are  based  on  well-known  fixed-connection  networks  such  as  the  hypercubc,  shuffle-exchange  graph, 
cube-connected  cycles  and  the  FFT  butterfly.  The  reason  that  these  networks  are  used  is  that 
they  can  support  fast  algorithms  for  interprocessor  communication  and  routing  of  data.  In  fact, 
most  of  thc;c  machines  will  spend  most  of  their  time  trying  to  get  the  right  data  to  the  right 
processor  at  the  right  time.  Hence  a  good  solution  to  the  data  routing  problem  is  critical  to  the 
successful  onefuct'en  of  supercomputers. 

In  a  fun  iamcntal  paper  [30],  Valiant  and  Brebner  formalized  the  importance  of  data  routing 
by  showing  that  if  an  A-processor  fixed-connection  network  could  solve  an  A/-packet  routing 
problem  in  T(A/)  steps,  then  it  could  simulate  any  A4-processor  parallel  machine  with  only  a 
T(  A/)-fac?or  time  delay.  If  M  =  N  and  T(N)  =  0(log  N),  such  a  network  could  be  reasonably 
called  untvrsal  since  it  could  simulate  any  other  parallel  machine  with  the  same  number  of 
processors  regardless  of  the  interconnection  architecture)  with  only  an  O(log  7V)-factor  time 
delay,  the  I  \nst  possible. 

Although  good  algorithms  are  known  for  fixed-permutation  [21]  and  random- permutation  [23, 
30]  data  routing  problems,  the  general  many-one  data  routing  problem  is  still  best  solved  as  a 
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special  ease  of  sorting  [4,  18,  29],  which  is  why  parallel  sorting  is  so  important.  In  fact,  were  a 
good  solution  to  Problem  2  found  for  A  =  10fi  (the  number  of  processors  in  currently  planned 
machines),  it  would  probably  provide  an  excellent  basis  for  the  architecture  of  a  supercomputer. 
Unfortunately,  all  of  the  constructions  described  thus  far  in  the  paper  utilize  the  AKS  sorting 
circuit  which  behaves  terribly  for  “small”  values  of  A  (e.g.,  A  <  10,o°).  Although  variations  of 
the  AKS  construction  have  been  proposed  [22],  they  are  still  a  long  way  from  being  considered 
practical. 

In  Section  5  of  this  paper,  we  use  columnsort  to  construct  0(log  A)-timc  “small-constant- 
factor”  sorting  networks  that  do  not  depend  on  the  AKS  construction.  These  networks  are  area- 
optimal  but  they  require  more  than  the  optimal  number  of  nodes.  Unfortunately,  researchers 
who  are  building  supercomputers  appear  to  be  constrained  by  processing  time  and  the  number 
of  nodes  as  well  as  by  wire  area.  Hence  the  networks  in  Section  5  are  probably  still  not  practical. 

Recently,  however,  we  have  discovered  probabilistic  versions  of  columnsort  for  which  it 
appears  possible  to  sort  (1  —  c)N  numbers  in  nearly  2  log  A  log  log  A  word  steps  on  an  A-node 
shuffle-exchange  graph.  Although  we  are  still  working  out  the  details,  it  seems  quite  possible 
that  the  algorithm  will  improve  the  traditional  log2  N  time  bound  by  an  order  of  magnitude  for 
A  =  106.  The  details  of  this  work  will  be  reported  when  completed  [19].  Similar  observations 
have  been  made  by  Ajtai  [l]. 


1.5  Summary 

The  results  described  in  this  paper  complete  the  asymptotic  characterization  of  the  number 
of  nodes,  number  of  steps,  amount  of  information  transfer  and  amount  of  area  needed  to  sort  A 
©(log  A)-bit  numbers  in  a  bounded-degree,  fixed-connection  network  in  both  the  bit  and  word 
models.  Up  to  the  AT 2  constraint,  all  of  the  lower  bounds  can  be  achieved  simultaneously  by 
a  single  construction  for  each  model.  In  addition,  the  constructions  are  of  the  small-constant- 
factor  variety  when  optimality  in  the  number  of  nodes  is  sacrificed.  For  easy  reference,  we  have 
summarized  the  bounds  in  the  Table  1.  (It  should  be  noted  that  the  lower  bound  on  the  number  of 


Table  1 

Bounds  on  the  Complexity  oj  Parallel  Sorting 

Word  Model  Bit  Model 


Number  of  Nodes 

Information  Transfer 

AT2  (for  fl(log  A)  <  T  <  0(\/N)) 


6(A) 

©(A  log  A) 

0(A) 

0(Alog  A) 

0(A2) 

0(  A2  log2  A) 
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nodes  needed  in  the  bit  model  holds  only  when  the  number  of  bits  per  number  exceeds  (1  +f)  log  TV 
for  some  constant  f  >  0,  and  that  the  lower  bound  on  the  amount  of  information  transfer  needed 
in  the  bit  model  is  only  known  to  hold  when  the  number  of  bits  per  number  exceeds  7  log  TV.) 
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2.  Columnsort 

In  this  section,  we  describe  a  simple  sorting  algorithm  that  we  call  columnsort.  The  algorithm 
is  a  generalization  of  odd-even  merge,  but  for  simplicity,  we  first  describe  the  algorithm  as  a 
scries  of  elementary  matrix  operations.  The  relationship  with  odd-even  merge  will  be  made  clear 
later. 

Let  Q  be  an  r  X  a  matrix  or  numbers  where  r«  =  TV,  «  |  r  and  r  >  2(s  —  l)2.  Initially,  each 
entry  of  the  matrix  is  one  of  the  TV  numbers  to  be  sorted.  After  completion  of  the  algorithm, 
the  i,j  entry  (0  <  i  <  r,  0  <  j  <  s)  of  Q  will  contain  the  pth  sorted  number  (0  <  p  <  TV) 
where  p  =  t  +  jr.  For  example,  Figure  4  illustrates  a  typical  matrix  before  and  after  sorting. 
(For  simplicity,  we  have  chosen  a  6  X  3  matrix  to  illustrate  the  algorithm  even  though  it  does 
not  satisfy  the  constraint  that  r  >  2(s  -  l)2.  We  will  discuss  the  relevance  of  this  constraint  and 
the  degree  to  which  it  can  be  relaxed  later.) 

Columnsort  has  eight  steps.  In  Steps  1,  3,  5  and  7,  the  numbers  within  each  column  are 
sorted.  (Just  how  we  accomplish  this  will  depend  on  the  application  and  does  not  matter  for  the 
analysis  in  this  section.)  In  Steps  2,  4,  6  and  8,  the  entries  of  the  matrix  are  permuted.  The 
permutation  in  Step  2  (shown  for  a  6  X  3  matrix  in  Figure  5)  corresponds  to  a  “transpose”  of 
the  matrix.  The  entries  arc  picked  up  column  by  column  and  then  deposited  row  by  row  (always 
going  from  top  to  bottom  in  a  column  and  from  left  to  right  in  a  row).  The  permutation  in  Step 
4  is  the  inverse  of  that  in  Step  2.  The  permutation  in  Step  6  corresponds  to  an  |r/2j-shift  of  the 
entries  and  is  shown  for  a  6  X  3  matrix  in  Figure  6.  The  permutation  in  Step  8  is  the  reverse  of 
that  in  Step  6.  The  step-by-step  application  of  columnsort  to  the  matrix  in  Figure  4  is  shown  in 
Figure  7. 
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Figure  4:  A  8  X  3  matrix  before  (a)  and  after  ( b )  sorting . 
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The  number  of  nodes  in  the  construction  of  Theorem  6  is  easily  seen  to  be  at  most 
O^N  +  -a  By  using  many  levels  of  columnsort,  the  number  of  nodes  can  be  reduced 

to  0(N  +  N )  Tor  any  constant  c  >  0,  without  increasing  the  running  time  or  the  area  by 

more  than  a  constant  factor.  (In  fact,  the  increase  in  time  is  polynomial  in  1/e.) 


6.  Remarks 

Ideas  sim  lar  to  those  used  to  develop  columnsort  can  also  be  found  in  the  work  of  Haggkvist 
and  Hell  [11'.  It  is  likely  that  columnsort  itself  has  also  been  discovered,  although  we  don’t 
know  of  any  references.  Judging  from  the  applications  developed  in  this  paper,  it  is  clearly  an 
important  technique  that  merits  further  study.  The  most  important  open  question  at  this  point 
is  whether  or  not  the  technique  can  be  used  recursively  to  construct  a  simple  0(log  N)- level 
sorting  circu  t.  As  yet,  we  have  not  seen  how  to  do  this,  but  there  are  many  ways  in  which  the 
technique  can  be  applied.  For  example,  the  technique  works  equally  well  in  a  setting  in  which 
lists  of  numbers  are  closesorted  (i.e. ,  in  a  setting  where  every  number  is  mapped  close  to  its  final 
position).  This  is  similar  but  stronger  than  the  notion  of  nearsort  employed  succesfully  by  Ajtai, 
Komlos  and  Szemeredi  [2].  As  another  another  example,  the  technique  might  work  well  when 
combined  with  some  sort  of  recursive  merge  procedure.  Lastly,  the  technique  seems  to  work  well 
in  a  probabilistic  setting.  At  this  point,  it  seems  likely  that  a  probabilistic  circuit  with  depth 
near  2  log  N  log  log  TV  can  be  constructed  using  columnsort-likc  ideas.  (Similar  observations  have 
also  been  made  by  Ajtai  [l].)  There  also  seems  to  be  plenty  of  room  for  improvement. 

In  ad  Jition  to  the  questions  relating  to  smail-constant-factor  o(log2  jVj-depth  circuits,  it  would 
be  interesting  to  pin  down  the  bounds  for  numbers  of  nodes,  area  and  time  for  sorting  TV  fc-bit 
numbers  for  values  of  k  not  covered  by  the  results  in  this  paper.  Substantial  progress  along  these 
lines  was  recently  made  by  Siegel  [25]. 
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columnsort  must  be  increased  by  a  factor  of  log  TV  for  the  same  reason.  As  a  result,  the  number 
of  nodes  is  increased  by  a  factor  of  ©(log  TV)  and  the  area  is  increased  by  a  factor  of  ©(log2  TV). 

The  time  is  unchanged.  Hence  by  Theorem  3,  the  network  has  O(TVlogTV)  nodes,  0^-N— ^ 
area  and  sorts  in  0(T)  steps. 

For  the  case  when  fl(\/TV)  <  T  <  0(\/N  log  TV),  we  modify  the  construction  in  Theorem  3 
for  T  =  Q{VN)  by  creating  —  times  as  many  subnetworks  to  sort  the  columns,  and  by 

increasing  the  capacity  of  the  n/TV  interblock  wires  in  the  permuters  by  a  factor  of  * N . 
As  a  result,  the  number  of  nodes  is  increased  from  0(TV)  by  a  factor  of  ©(log  TV)  to  a  total 

of  ©(TVlogTV),  the  area  is  increased  from  ©(TV)  by  a  factor  of  ^(V^T>8 j  10  a  total  of 

©(**  rf~~  ),  and  the  time  has  increased  from  6[\/N)  by  a  factor  of  to  a  total 

of  0(T),  as  claimed.  Notice  that  we  cannot  decrease  A  further  due  to  the  lower  bound  on  area 
induced  by  the  number  of  nodes  (Theorem  2).  | 


5.  Small-Constant-Factor  Networks 

All  of  the  sorting  networks  described  thus  far  involve  the  AKS  sorting  network.  Although 
this  circuit  performs  well  asymptotically,  it  performs  very  poorly  for  any  feasible  value  of  TV.  In 
this  section,  we  describe  networks  that  are  optimal  in  every  respect  except  the  number  of  nodes, 
and  that  do  not  use  the  AKS  circuit.  As  a  result,  the  networks  constructed  in  this  section  will 
be  what  we  refer  to  as  small-constant-factor  layouts  (i.e.,  the  true  quantities  are  less  than,  say, 
ten  times  the  quantities  stated  in  the  Big  Oh  notation).  For  simplicity,  we  will  only  derive  the 
construction  for  the  word  model  since  the  construction  for  the  bit  model  will  be  nearly  identical. 

Theorem  6:  For  any  T  in  the  range  log  TV  <  T  <  y/N,  there  is  a  small- constant- far  tor, 
bounded-degree  network  with  0(TV2/7'2)  area  that  can  sort  TV  numbers  in  0(T)  word  steps. 

Proof:  The  construction  is  similar  to  that  for  Theorem  3,  except  that  several  meshes  of  trees 
are  used  to  sort  columns  instead  of  a  single  AKS  circuit.  In  (14,  15,  16],  we  showed  how  to  sort 
M  numbers  in  O(logM)  word  steps  using  an  0(A/2)-node,  0(A/2log2  A/)-area  mesh  of  trees.  In 
this  application,  we  use  log2  TV  meshes  of  trees  each  of  size  sufficient  to  sort  T |o^  N  numbers. 
This  collection  of  meshes  of  trees  has  at  most  0(Ni/Ti)  area  and  (by  pipelining)  is  capable  of 
sorting  Tlog2  TV  (r  (-number  columns  in  O(T)  steps.  Hence  they  can  be  used  in  conjunction 

with  a  Y\ip~N  *  TXog?  TV  columnsort  to  achieve  the  desired  bounds.  For  T  near  or  greater  than 
TV1/3,  two  levels  of  columnsort  are  needed,  just  as  in  the  proof  of  Theorem  3.  | 

Although  wc  did  not  not  analyze  the  constant  factors  in  the  proof  of  Theorem  6,  they  are 
not  large.  In  fact,  by  using  more,  smaller  meshes  of  trees  and,  say,  a  X  Tlog* TV 

columnsort,  the  sorting  part  of  the  circuit  can  be  made  to  have  only  o(TV2/!F2)  area!  The  only 
major  contribution  to  the  area  in  such  a  network  are  the  wires  that  permute  the  data  before  and 
after  it  is  sorted  in  the  columns,  and  they  are  easy  to  lay  out. 
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By  the  first  lower  bound  proved  for  7,  this  means  that  7  >  | N\ogN .  Hence,  we  can  assume  in 
what  follows  that  there  are  at  least  2  log  A  bit  positions  of  each  type. 

Divide  the  6  log  A’  least  significant  bit  positions  into  two  contiguous  segments  so  that  one 
segment  (say  the  one  containing  the  most  significant  bits)  contains  at  least  log  A  bit  positions 
labeled  L  and  so  that  the  other  segment  contains  at  least  log  A  positions  labeled  R.  This  can 
be  done  by  seaming  the  bit  position  labels  from  left  to  right  (most  significant  positions  first) 
until  one  of  the  labels  (say  L)  has  been  seen  log  A  times.  Partitioning  the  bit  positions  at  this 
point  gives  log  A'  L  labels  in  the  most  significant  segment  and  at  least  log  A  R  labels  in  the 
least  significant  segment.  Next  fix  all  input  bits  to  zero  except  for  those  in  the  log  A  positions 
labeled  L  i;i  the  most  significant  segment  and  those  in  log  A  of  the  positions  labeled  R  in  the 
least  significant  segment.  Fix  the  bits  in  the  log  A  positions  labeled  R  in  the  least  significant 
segment  so  that  the  tth  input  number  contains  the  binary  representation  of  i  in  these  positions 
(0  <  t  <  A).  Let  the  input  bits  in  the  log  A  positions  labeled  L  in  the  most  significant  segment 
vary  to  induce  all  Ar!  permutations  of  the  values  in  the  log  A  bit  positions  labeled  R  in  the  least 
significant  segment.  We  are  now  almost  done. 

By  providing  the  right  half  of  the  network  with  the  values  of  the  least  significant  log  A  output 
bits  in  positions  labeled  R  that  it  is  not  required  to  output,  the  right  half  of  the  network  must 
be  able  to  produce  all  A!  combinations  of  the  least  significant  output  bits  correctly.  Aside  from 
the  information  transfer  7,  the  right  half  sees  at  most 

log  N  <  j  <  7  log  N 

2  +  £  nun(/„r,) 

r,  <t, 

nontrivial  bits  of  the  input.  Hence,  we  know  that 

log  N  <] <7  log  N  log  JV<y<7  log  N 

min(/y,  ry)  +  7  +  Z  -f  min(/y,  ry)  >  log( A!) . 

h<r,  r,<l, 

Simplifying  and  applying  two  of  the  previous  lower  bounds  for  I  gives  57  >  log(A!)  and  thus 
7  >  fi(A  log  A),  as  claimed.  | 

The  preceding  result  can  be  extended  for  sorting  A  fc-bit  numbers  for  k  <  7  log  A,  but  we 
don’t  know  how  to  prove  it  for  k  =  (1  +  e)log  A.  These  lower  bounds  can,  in  fact,  be  achieved 
by  a  variety  of  networks,  as  we  show  in  the  following  theorem. 

Theorem  5:  For  any  T  in  the  range  log  A  <  T  <  y/N  log  A,  there  is  an  0(N  log  N)-node 
bounded- degree  network  with  0{ N  yl  "  )  area  that  is  capable  of  sorting  A  0(log  N)-bit  numbers 
in  O(T)  bit.  steps. 

Proof:  The  construction  for  the  case  when  T  =  0(>/N)  is  nearly  identical  to  that  in  Theorem 
3.  The  firs;,  difference  is  that  log  A  times  as  many  subnetworks  are  needed  to  sort  the  columns, 
since  the  tiroughput  for  bit  serial  computations  is  slower  by  a  factor  of  0(log  A).  In  addition, 
"we  must  pipeline  the  bits  of  each  word  in  the  manner  described  in  Section  1.2.  The  second 
difference  is  that  the  capacity  of  the  y  interblock  wires  in  the  permuters  in  Steps  2,  4,  6  and  8  of 
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checks  overall.  This  means  that  at  least  one  of  the  N  rows  contains  at  least 


7  log  N 

£  min(/a-,  tj) 

j  =  1  +  log  N 

checks.  Let  p  be  the  shift  corresponding  to  this  row  and  fix  the  first  log  AT  bits  of  each  input 
number  to  force  a  p-shift  of  the  less  significant  bits.  Without  loss  of  generality,  at  least 


j  7  log  N 

-  min(/y,  ry) 

J»l+log/tf 

jl 

of  the  less  significant  input/output  bit  pairs  are  input  in  the  left  half  of  the  network  and  output 
in  the  right  half.  Set  all  the  other  input  bits  to  tero.  The  network  has  now  been  reduced  to 
accepting  a  string  of 

.  7  log  N 

-  £  minify) 
j-l  +  loglV 

input  bits  in  the  left  half  and  outputting  the  same  string  in  the  right  half.  By  a  straightforward 
crossing  sequence  argument  (e.g.,  see  [20]),  this  means  that 


1 


7  log  N 


£  min(/„r,), 

j'*—l+log  N 


as  claimed. 

As  a  special  case  of  the  preceding  analysis,  we  can  also  prove  that  /  >  \Z  where  Z  is  the 
number  of  i,j  pairs  (j  >  log  N)  for  which  i,y  and  y,y  are  in  different  halves  of  the  network.  This 
is  easily  seen  by  fixing  the  first  log  N  bits  of  each  input  to  force  a  0-shift  of  the  less  significant 
bits,  and  then  following  the  same  argument  as  before. 

Since  we  will  not  use  the  first  log  N  bits  of  the  numbers  henceforth,  we  fix  them  to  be  tero  for 
the  remainder  of  the  proof.  For  j  >  log  N,  label  the  jth  bit  position  L  or  R  depending  on  whether 
most  of  the  N  jth  output  bit  positions  are  in  the  left  (L)  or  right  (R)  half  of  the  network.  Unless 
/  >  fi( N  log  TV)  (in  which  case  we  are  done),  at  least  2  log  N  of  the  last  6  log  N  bit  positions  art 
labeled  L  and  at  least  2  log  N  are  lablcled  R.  If  not,  then  (without  loss  of  generality)  there  are  at 
most  2  log  N  positions  with  the  majority  of  the  output  bitB  in  the  left  half  of  the  network.  This 
means  that  at  most  f 

log  7V<y<,7  log  N 

N  log  N  +  2  N  log  N  +  V  min(/„  r,) 

l,<rj 

of  the  7N  log  N  output  bits  are  output  in  the  left  half  of  the  circuit.  (At  worst,  all  N\ogN  of 
the  leading  log  N  bits  of  each  number  arc  output  in  the  left  half,  which  accounts  for  the  extra 
N  log  N  term  in  the  preceding  sum.)  By  assumption,  this  quantity  must  be  ^NlogN  and  hence 

7  log  N 

£  min(f„r;)  >  -N  log  N. 

i-l+loiW  * 
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4.2  Results  for  the  Bit  Model 

In  what  follows,  we  assume  that  each  number  to  be  sorted  consists  of  k  bits  where,  typically, 
k  =  0(log/V).  If  the  k  bits  of  each  number  are  input  and  output  locally  in  the  network,  then 
it  is  easy  to  show  that  1  >  Q(kN)  and  thus  that  AT 2  >  tl(k2N 2)  for  sorting  N  fc-bit  numbers 
in  the  bit  model.  For  large  k,  however,  it  is  possible  to  construct  circuits  for  which  AT 2  < 
0(N2 k \og  N  \og2(^jj))  which  is  much  less  than  Q(k2N2)  (18).  Hence  for  large  k,  it  is  more 
efficient  to  input  and  output  the  bits  of  each  number  in  vastly  different  parts  of  the  network.  If 
the  same  were  true  for  k  =  0(log  N),  then  the  intuition  that  AT2  >  Cl(N2  log2  N )  doesn’t  hold. 
Nonetheless,  the  result  is  still  true,  as  we  show  in  the  following  theorem. 

Theorem  4:  The  information  transfer  I  tn  any  when  and  where  oblivious  network  for  sorting 
N  (7  logIV)-6»t  numbers  must  be  at  least  Cl(N  log  N)  in  the  bit  model. 

Proof:  Consider  a  when  and  where  oblivious  network  that  sorts  N  (7  log  7V)-bit  numbers, 
and  any  paitition  of  the  network  into  left  and  right  halves  that  evenly  splits  the  location  of  the 
output  bits.  As  before,  we  let  x»y  denote  the  jih  most  significant  bit  of  the  tth  input  number 
and  y,y  donate  the  jth  bit  of  the  tth  sorted  output  number  (0  <  i  <  N,  1  <  j  <  7  log  TV).  By 
definition,  the  partition  splits  the  set  of  y,y’s  in  half.  (To  be  precise,  the  partition  might  not  be 
exactly  half  half,  but  anything  close  will  do,  without  changing  the  structure  of  the  proof.) 

The  preof  consists  of  proving  several  seemingly  unrelated  lower  bounds  for  I  based  on 
various  “worst-case”  computations.  At  the  end,  we  combine  the  lower  bounds  to  show  that 
/  >  Q[N\ogN). 

For  j  in  the  range  log  N  <  j  <  7  log  N,  let  ry  be  the  number  of  «  such  that  *jy  is  input  to 
the  right  half  of  the  network.  Similarly,  define  lj  to  be  the  number  of  *  such  that  *</is  input  to 
the  left  half  of  the  network.  Clearly,  ry  +  lj  =  N  for  every  j.  We  first  show  that 

t  7  log  N 

7>o  £  min(/y,  ry) . 

1+logN 

Construct  a  table  with  rows  corresponding  to  “shifts”  p  and  columns  corresponding  to  output 
bits  y,y  for  0  <  t  <  N  and  log  N  <  j  <  7  log  N.  By  setting  the  first  log  N  bits  of  the  tth  input 
number  to  be  t  +  p  (mod  N)  for  each  i,  the  network  can  be  forced  to  shift  (with  wraparound) 
the  low  order  61ogN  bits  of  each  number  by  any  amount  p  (0  <  p  <  N).  If  for  a  particular 
shift  p  and  output  y,y,  the  input  bit  that  is  sent  to  y<y  by  the  shift  (z^_Piy)  is  not  in  the  same 
half  as  y,y,  then  place  a  check  in  the  corresponding  position  of  the  table.  The  number  of  checks 
in  any  row  is  a  natural  (and  standard)  measure  of  the  information  transfer  that  will  be  required 
to  carry  out  the  corresponding  shift. 

If  y,y  is  in  the  right  half,  then  there  are  !y  checks  in  the  corresponding  column.  Otherwise 
there  are  r.  checks  in  that  column.  In  either  case,  the  column  corresponding  to  y,y  contains  at 
least  min(!y,ry)  checks.  Hence  the  table  contains  at  least 

7  log  N 

N  j)  min(Iy.ry) 
j«*I+log  N 
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achieve  this  bound  for  any  A  and  T  in  the  range  0(log  N)  <  T  <  0(y/N)  with  a  bounded-degree 
network  containing  O(N)  nodes.  In  fact,  a  simple  variant  of  the  network  described  in  Section  3.1 
suflices.  To  prove  this,  wc  first  consider  the  case  when  T  ^  ^5—,  and  we  show  how  to  implement 
an  X  T  columnsort  procedure.  The  analysis  is  divided  into  two  parts.  In  the  first  part  we 
show  how  the  permutations  in  Steps  2,  4,  6  and  8  can  be  implemented  in  0{T)  steps  with  an 
0(7V)-node  subnetwork  that  has  0(N2 /T 2)  area  for  any  T  in  the  range  logAf  <  T  <  y/N.  In 
the  second  part,  we  show  how  to  sort  the  columns. 

The  permutations  in  Steps  2,  4,  6  and  8  have  the  nice  property  that  numbers  which  are 
adjacent  in  the  input  matrix  (columnwise)  are  also  adjacent  in  the  output  matrix  (rowwise  for 
steps  2  and  4,  and  columnwise  for  steps  6  and  8).  Hence  for  each  or  the  four  permutations,  it 
is  easy  to  partition  the  input  matrix  and  the  output  matrix  into  N/T  blocks  of  T  consecutive 
matrix  positions  so  that  the  numbers  in  each  block  of  the  input  matrix  are  mapped  to  a  cor* 
responding  block  of  the  output  matrix.  Thus  by  linking  each  of  the  N/T  blocks  of  T  inputs 
to  its  corresponding  block  of  outputs  with  a  single  wire,  it  is  possible  to  complete  the  desired 
permutation  in  T  parallel  steps.  (Unit-length  wires  are  also  used  connect  adjacent  positions  or 
the  input  and  output  matrices.)  As  there  arc  only  0(N/T)  non-unitrlength  wires,  the  resulting 
network  consumes  at  most  0[N2/T2)  area,  as  claimed. 

It  remains  to  bound  the  area  and  time  required  for  the  sorting  steps.  Each  sorting  step 
is  accomplished  with  an  TV/T-number  AKS  sorting  circuit.  Applying  the  Bilardi-Prcparata  [8] 
result  that  the  M-input  AKS  circuit  has  0(M2)  area,  we  find  that  the  area  of  the  Af/T-number 
sorting  circuit  is  0(N2 /T2).  Moreover,  this  circuit  has  log  ^f)  <  O(N)  nodes  and  is  capable 
(by  pipelining)  of  sorting  T  ^-number  columns  in  0(T)  steps  (since  T  >  log  N),  as  claimed. 

The  algorithm  just  described  works  for  any  T  in  the  range  fl(IogAf)  <  T  <  ^5^.  The 
construction  can  be  extended  for  T  <  0{y/N)  by  applying  columnsort  twice  before  plugging 
in  an  AKS  network.  In  particular,  we  first  use  the  preceding  argument  to  construct  a  network 
with  O(M)  nodes  and  0(M4^3)  area  that  is  capable  of  sorting  M  numbers  in  0{Mi^3)  word 
steps  for  arbitrary  M.  We  then  apply  a  M  X  $  columnsort  procedure  to  sort  N  numbers 
for  T  in  the  range  <  T  <  Nx>2  where  M  =  Since  T  <  y/N,  M  >  N*'4  and 

thus  M  >  2 ($  —  l)a.  Hence  the  M  X  $  columnsort  will  complete  the  sorting.  The  fixed 
permutations  are  performed  as  before  using  O(N)  nodes,  O(T)  steps  and  0{N2/T2)  area.  The 
M-number  columns  are  sorted  in  turn  by  the  single  0(A/)-node  circuit  just  constructed.  The 
area  of  the  sorting  part  of  the  Af-numbcr  circuit  is  just  0(M4^3)  =  0(N2/T2).  The  number  of 
nodes  is  O(M)  =  <  0[N)  for  T  >  The  total  time  taken  to  sort  all  the  columns 

is  $  •  0[M1!3)  =  0[T).  Wc  summarize  this  result  in  the  following  theorem. 

Theorem  3:  For  any  T  in  the  range  logAf  <  T  <  y/N,  there  i»  an  0(N)-node,  bounded- 
degree  network  with  0(N2/T2)  area  that  can  sort  N  numbers  in  0(T)  word  steps. 

As  before,  the  constants  in  the  construction  can  be  improved  to  the  point  where  the  network 
consists  of  N  nodes,  each  with  degree  at  most  three. 
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Figure  10:  An  example  of  a  set  of  inputs  that  forces  the  network  to  see  Xij  before  outputting 
Vr*. 

any  of  the  last  log  TV  bits  of  the  numbers  and  thus  there  are  still  AT!  distinct  possible  sets  of 
remaining  output  bits  (depending,  of  course,  on  the  values  assigned  to  the  first  log  N  bits  of  each 
input  number).  Since  the  network  has  already  received  all  the  unfixed  input  bits  at  this  point, 
there  is  only  one  possible  set  of  remaining  input  bitB.  Hence,  the  network  must  have  at  least 
log(AH)  =  n(NlogN)  bits  of  state.  Since  each  node  has  only  a  constant  number  of  bits  of  state, 
this  means  that  the  network  must  contain  at  least  n(TVlogN)  nodes.  | 


It  is  not  difficult  to  prove  the  same  lower  bound  for  sorting  N  numbers,  each  with  (1  +  e)  log  N 
bits  for  any  constant  t  >  0.  The  bound  does  not  hold  for  k-bit  numbers  when  k  <  log  AT,* 
however,  since,  given  enough  time,  it  is  possible  to  sort  N  (log  7V)-bit  numbers  with  0[N)  nodes. 
The  exact  number  of  nodes  required  to  sort  N  fc-bit  numbers  for  arbitrary  Jfc  has  recently  been 
worked  out  by  Siegel  [25]. 


4.  Bounds  on  Information  Transfer  and  Area 

i  In  this  section,  we  establish  tight  bounds  on  the  information  transfer  and  wire  area  required  to 

*  sort  in  both  the  word  and  bit  models.  The  lower  bounds  arc  the  most  difficult  and  are  established 
by  lower  bounding  the  information  transfer.  The  upper  bounds  are  established  by  upper  bounding 
the  area  and  time.  Information  transfer,  area  and  time  are  related  by  Thompson’s  fundamental 
AT 2  >  n(/2)  tradeoff.  We  refer  the  unfamiliar  reader  to  [27]  for  a  detailed  explanation  of  area, 
time  and  information  transfer  and  for  a  simple  proof  of  the  tradeoff. 


4.1  Results  for  the  Word  Model 

Thompson  [27]  showed  that  /  >  Cl[N)  for  the  problem  of  sorting  N  numbers  in  the  word 
model,  and  hence  that  AT 2  >  fl(N7)  for  any  such  circuit.  In  what  follows,  we  show  how  to 
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Corollary:  When  combined  with  the  AKS  sorting  circuit,  the  construction  in  Theorem  1 
gives  an  0(N)-node,  bounded-degree  network  that  sorts  TV  numbers  in  O(logTV)  word  steps. 

As  Kruskal  [13]  observed,  the  constants  in  the  preceding  construction  can  be  improved  to  the 
point  where  the  TV-number  sorting  network  is  dcgree-3  and  has  at  most  TV  nodes.  To  prove  this, 
first  note  that  every  bounded-degree  node  can  be  locally  expanded  into  a  constant  number  of 
nodes  each  with  degree  at  most  3.  The  resulting  network  still  has  cN  nodes  for  some  constant  c 
and  still  sorts  N  numbers  in  0(log  Ar)  steps.  The  proof  is  completed  by  observing  that  the  same 
network  can  actually  sort  a  list  of  cN  numbers  by  replacing  each  number  with  a  list  of  c  sorted 
numbers  and  each  comparison  of  two  numbers  with  a  merging  and  halving  of  two  c-number  lists. 
The  proof  that  the  resulting  algorithm  actually  sorts  the  cN  numbers  is  not  hard  to  work  out. 
(For  example,  see  [5]  or  Problem  38  of  Section  5.3.4  in  [12].) 

Because  every  number  to  be  sorted  must  be  input  before  the  rank  of  any  number  can  be 
determined,  all  TV  numbers  must  be  input  and  remembered  before  any  of  them  can  be  output.  If 
each  node  can  remember  at  most  a  constant  number  of  numbers,  this  means  that  at  least  fl(TV) 
nodes  are  required  to  sort  TV  numbers  in  the  word  model.  This  kind  of  argument  is  used  more 
carefully  in  the  following  section,  where  we  prove  an  fi(TVlog./V)-node  lower  bound  for  the  bit 
model. 

3.2  Results  for  the  Bit  Model 

As  was  mentioned  in  the  introduction,  the  TV-input  AKS  sorting  circuit  can  also  be  used  as  an 
0(N  log  TV)-node  bounded-degree  network  for  sorting  TV  numbers  in  O(logTV)  bit-steps.  In  what 
follows,  we  show  no  fewer  nodes  could  have  been  used  (up  to  a  constant  factor),  no  matter  how 
much  time  is  allowed.  The  proof  applies  only  to  network  algorithms  that  are  when  and  where 
oblivious  (i.e.,  to  algorithms  for  which  the  time  and  location  of  each  input  bit  and  output  bit  is 
fixed  ahead  of  time,  so  as  not  to  be  dependent  on  the  value  of  the  inputs  or  the  running  of  the 
algorithm).  In  addition,  the  inputs  arc  supplied  just  once. 

Theorem  2:  Any  when  and  where  oblivious  network  capable  of  sorting  N  (2  log  7V)-6»< 
numbers  in  the  bit  model  must  have  fl(TV  log  TV)  nodes. 

Proof:  The  basic  idea  is  to  show  that  a  large  portion  of  the  input  bits  must  be  presented  to 
the  network  before  much  of  the  information  can  be  output,  thus  forcing  the  network  to  remember 
a  large  number  of  bits. 

Let  x ij  denote  the  j'th  most  significant  bit  of  the  t'th  input  word  and  y,y  denote  the  jth  bit 
of  the  t'th  sorted  word  (0  <  t  <  TV,  1  <  j  <  2  log  N).  We  first  show  that  any  input/output 
schedule  for  a  correct  algorithm  must  input  z,y  before  outputting  yrt  whenever  s  >  j.  If  this 
were  not  the  case,  then  consider  the  action  of  the  algorithm  on  the  input  numbers  (written  in 
binary)  shown  in  Figure  10.  Every  input  bit  is  specified  except  for  z.y.  If  z,y  =  0,  then  the  rth 
sorted  number  is  (in  binary)  all  zeros  except  for  a  one  in  the  jfth  position.  Since  s  >  j,  this 
means  that  yr ,  —  0.  If  xXJ  =  1  on  the  other  hand,  then  the  rth  sorted  number  is  (in  binary)  all 
zeros  except  for  a  one  in  the  jth  and  sth  positions,  and  yri  =  1.  Hence,  there  is  no  way  that  the 
algorithm  can  always  correctly  output  yr ,  before  seeing  zxj. 

Because  the  circuit  is  when  oblivious,  the  preceding  argument  means  that  any  sorting  circuit 
must  (in  particular)  input  ztJ  for  all  t  <  TV  and  j  <  log  TV  before  outputting  yrt  for  any  r  <  N 
and  s  >  log  Ar.  Consider  inputs  for  which  the  last  log  TV  bits  of  the  t'th  input  number  are  fixed 
to  equal  i  for  each  i  <  TV.  Also  consider  the  step  of  the  algorithm  in  which  the  last  of  the  first 
log  N  bits  of  each  number  is  input  to  the  network.  At  this  point  the  algorithm  has  not  output 


1 


’a 

9 

m 

"a 

b 

cf 

b 

h 

n 

c 

e 

9 

c 

i 

0 

f 

a 

j 

d 

j 

P 

i 

k 

m 

e 

k 

9 

■ 

( 

77 

P 

/ 

l 

r 

0 

9 

T 

Figure  8s  A  more  effective,  diagonalizing  permutation  for  Step  4  of  columruort 


for  all  but  the  extreme  cases  when  *+/  >  r  and  i+j  <  »—  1.  This  would  mean  that  every  number 
would  be  at  most  s(s  —  l)/2  from  Us  correct  position  after  Step  4,  and  thus  that  r  >  »(a  —  1) 
would  be  sufficient.  In  addition,  the  constraints  that  ra  —  N  and  a  |  r  can  be  removed  provided 
that  r  is  one  larger. 

3.  Bounds  on  the  Number  of  Nodes 


In  this  section,  we  prove  that  /V  nodes  are  sufficient  to  Bort  N  numbers  in  the  word  model 
and  that  ©(.V  log  Ar)  nodes  are  necessary  to  sort  N  fc- bit  numbers  in  the  bit  model  whenever 
*  >  (1  +  f)log  A'  for  some  constant  t  >  0. 


3.1  Results  for  the  Word  Model 

In  the  following  theorem,  we  show  how  to  use  columnsort  to  convert  a  family  of  /(TV)-level 
circuits  for  sorting  /V  numbers  into  a  family  of  bounded-degree,  0(7V)-nodc  circuits  that  can 
sort  N  numbers  in  0(f(N))  word  steps.  In  the  theorem,  we  choose  f(N)  —  o[Nl^s)  so  that 
r  >  2(*  —  l)a  when  columnsort  is  applied,  where  r  =  j ^  and  a  =  f{N).  As  a  consequence,  we 
can  transform  the  AKS  circuit  for  Problem  1  into  a  solution  for  Problem  2. 

Theorem  1:  Given  a  monotone  function  f  auch  that  f{N)  =  o(N1^3)  for  all  N  and  a  family 
of  f(N)-level  circuits  for  sorting  N  numbers,  one  can  construct  a  family  of  bounded-degree, 

0(N)-node  networks  that  can  sort  N  numbers  in  0(f(N))  word  steps.  f 

i 

Proof:  Select  a  circuit  from  the  family  that  sorts  j ^  numbers.  Since  /  is  monotone,  this^ 
circuit  has  depth  f(j^)  <  f{N)  and  at  most  N  nodes.  By  pipelining  the  columns  of  an 
jffi)  *  /(Af)  matrix  through  the  circuit,  the  columns  of  the  matrix  can  all  be  sorted  in  2 f(N) 
word  steps.  By  simply  hard-wiring  the  four  fixed  permutations  used  in  Steps  2,  4,  6  and  8  of 
columnsort,  a  matrix  of  N  numbers  can  be  sorted  in  0(f{N))  word  steps  by  columnsort.  The 
network  is  pictured  in  Figure  9  for  the  special  case  when  f[N)  =  0(log  Af).  The  total  number  of 
processors  used  is  clearly  O(N)  and  each  processor  has  bounded  degree.  Moreover,  the  processors 
need  to  have  only  a  finite  amount  of  state  information  aside  from  the  ability  to  store  and  compare 
©(log  IV)-bit  numbers.  | 
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Hence  the  position  of  z  after  Step  4  is  at  most 


(.  -  l)1  -  >(.  -!)<(.-  1)> 


beyond  its  correct  position. 


A  symmetric  argument  shows  that  the  true  rank  of  z  is  at  most  at  +  aj.  Hence  the  position 
of  z  after  Step  4  is  at  most  j{a  —  1)  <  (a  —  l)2  short  of  its  correct  sorted  position.  Thus  we  have 
established  that  every  number  is  within  (a  —  l)2  of  its  correct  position  after  Step  4.  When  a  =  2, 
we  have  the  special  case  result  for  odd-even  merge.  When  a  =  r  =  >/N,  we  see  that  virtually 
nothing  has  been  gained,  which  is  why  the  algorithm  doesn’t  work  for  square  matrices. 

We  now  show  that  Steps  5-8  are  sufficient  to  complete  the  sorting.  For  simplicity,  we  assume 
in  what  follows  only  that  every  number  is  within  [r/2j  of  its  correct  sorted  position.  Since  r  > 
2(s  —  l)2,  w<;  are  always  guaranteed  that  this  condition  is  met  after  completion  of  Step  4. 

After  Step  4,  every  number  that  belongs  in  the  top  half  of  column  j  (0  <  j  <  a)  when  sorted 
is  in  column  j  or  in  the  bottom  half  of  column  j  —  1.  Similarly,  every  number  that  belongs  in  the 
bottom  half  of  column  j  is  in  column  j  or  the  top  half  of  column  j  + 1.  Otherwise,  some  number 
would  be  more  than  [r/2j  away  from  its  correct  position.  After  Step  5,  every  number  that  belongs 
in  the  top  half  of  column  j  is  in  the  top  half  of  column  j  or  the  bottom  half  of  column  j  —  1. 
Were  this  not  true  and  were  such  a  number  x  to  be  in  the  bottom  half  of  column  j,  then  z,  every 
number  ahead  of  x  in  column  j  and,  of  course,  every  number  in  columns  0, 1,  ...,j  —  1  would  have 
to  have  rank  less  than  rj  +  r/2,  which  is  impossible.  Alternatively,  were  z  in  the  top  of  column 
j  —  1  at  this  point,  then  there  could  be  at  most  r(j  —  1)  +  5  —  1  j-  £  =  rj  —  1  numbers  of  rank  less 
than  or  equal  to  rj  —  1 ,  which  is  also  impossible.  (Recall,  that  there  are  rj  such  numbers  since 
the  smallest  number  has  rank  zero.)  This  total  is  calculated  by  counting  the  r(j  —  1)  numbers 
in  columns  0, 1, ...,  j  —  2  ,  the  j  —  1  or  fewer  numbers  ahead  of  z  in  column  j  —  1  and  the  £  or 
fewer  numbers  in  column  j  that  could  belong  in  column  j  —  1.  Using  identical  arguments,  we  can 
also  show  that  after  Step  5,  every  number  that  belongs  in  the  bottom  half  of  column  j  is  in  the 
bottom  half  of  column  j  or  the  top  half  of  column  j  +  1. 

Combining  the  two  facts  in  the  proceeding  paragraph,  we  find  that  every  number  that  should 
be  in  the  bottom  half  of  column  j  or  the  top  half  of  column  j  +  1  when  sorted  is  in  one  of  these 
two  half-columns  after  Step  5.  Hence,  Steps  6-8  complete  the  sorting.  This  completes  the  proof 
that  columnsort  works. 

Columnsort  provides  an  efficient  way  to  sort  N  numbers  given  that  we  know  how  to  sort  r 
numbers  where  re  —  N,  a  \  r  and  r  >  2(s  —  l)2.  For  example,  24  numbers  can  be  sorted  by 
repeatedly  sorting  subsets  of  8  numbers.  It  would  be  interesting  to  know  how  much  the  constraint 
on  the  size  of  r  can  be  relaxed  without  radically  changing  the  algorithm.  Some  improvement  is 
definitely  possible.  For  example,  if  Step  4  were  replaced  with  the  diagonalizing  permutation 
shown  in  Figure  8,  it  would  be  necessary  only  that  r  >  e(s  —  1).  This  is  because  a  number  in  the 
i,j  position  after  Step  3  would  then  correspond  to  a  rank  of 

at  +  aj  —  s(a  —  l)/2  +  a  —  1  —  j 
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At  first  glance,  it  seems  impossible  that  such  an  algorithm  works.  For  example,  if  the  matrix 
were  square  (i.c.,  if  r  =  a,  which  is  not  allowed),  then  we  would  essentially  just  be  sorting  rows 
and  columns  which  is  well-known  to  leave  entries  arbitrarily  far  away  from  their  correct  sorted 
position. 

A  far  better  intuition  comes  from  the  case  when  r  =  N/ 2  and  «  =  2.  In  this  special  case,  we 
have  precisely  odd-even  merge.  In  odd-even  merge,  a  list  of  N  numbers  is  divided  into  2  sublists, 
each  with  N /2  numbers.  This  division  corresponds  to  entering  the  numbers  in  an  N/2  X  2  matrix: 
each  column  of  the  matrix  contains  a  sublist.  The  two  sublists  are  then  sorted,  as  is  done  in  Step 
1  of  columnsort.  Then  the  odd-index  numbers  in  each  sublisl  are  combined  to  form  a  new  sublist, 
as  are  the  even-index  numbers.  This  corresponds  to  the  transpose  (or  “unshuffle”)  operation  in 
Stc|b  2  of  columnsort.  Next,  each  sublist  is  sorted,  as  is  done  in  Step  3  of  columnsort.  (In  odd-even 
merge,  this  sorting  step  is  accomplished  with  a  recursive  merge.)  After  sorting,  the  sublisls  are 
shu filed  together,  as  is  done  in  Stop  4  of  columnsort.  At  this  point,  every  number  is  within  one 
of  its  correct  position,  so  each  number  is  compared  to  its  neighbors  and  (possibly)  interchanged, 
thus  completing  the  sorting.  The  same  maneuver  is  accomplished  in  a  rather  brute-force  .way  by 
columnsort.  In  Step  5,  all  but  the  top  and  bottom  numbers  in  each  column  are  compared  to  their 
neighbors  by  sorting  the  columns.  Steps  6-8  insure  that  comparisons  are  made  between  numbers 
at  the  bottom  of  one  column  and  the  top  of  the  next  column. 

The  action  of  columnsort  for  arbitrary  r  >  2(s  —  l)2  is  very  much  like  that  for  odd-even 
merge.  After  Step  4,  we  will  be  guaranteed  that  every  number  is  within  (*  —  l)2  of  its  correct 
sorted  position.  Then  Steps  5-8  are  sufficient  to  finish  the  sorting.  We  prove  these  two  facts  in 
what  follows. 

Consider  a  number  x  that  is  in  position  t,  j  of  the  matrix  after  Step  3.  A  simple  calculation 
shows  that  x  is  ftent  to  a  position  in  Step  4  that  corresponds  to  a  rank  of  p  =  ia  +  j  in  the  sorted 
list.  (Recall  our  convention  that  the  smallest  number  has  rank  zero.)  From  the  position  of  x  after 
Step  3,  we  know  that  x  is  greater  than  or  equal  to  at  least  »  +  1  numbers  in  the  jth  column  of 
the  matrix  after  Step  2.  Let  denote  the  number  of  these  *  +  1  numbers  that  originally  come 
from  column  k  of  the  matrix  (i.e.,  before  Step  2  transposed  the  matrix).  By  definition, 

•-i 

*  +  1  =  £at. 

km.0 

Since  only  the  jfth  and  every  sth  number  thereafter  of  the  (sorted)  kth  column  after  Step  1 
appear  in  the  jth  column  after  Step  2,  this  means  that  z  is  greater  than  or  equal  to  at  least 
(a*  —  l)a  +  j  +  1  numbers  in  the  fcth  column  of  the  matrix  after  Step  1.  Hence  the  true  rank  of 
x  is  at  least 

•—i 

£  ((«*  -  1)#  +  j  +  11  -  1 . 

*— 0 

Substituting  :  +  1  for  o°fc  and  simplifying,  we  find  that  the  true  rank  of  x  is  at  least 

at  +  aj  -  (a  -  l)2  . 
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Figure  7s  The  step  by  step  application  of  columnaort  to  the  matrix  in  Figure  4 • 
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